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ABSTRACT 

Action recognition from still images is an important task 
of computer vision applications such as image annotation, 
robotic navigation, video surveillance and several others. 
Existing approaches mainly rely on either bag-of-feature rep¬ 
resentations or articulated body-part models. However, the 
relationship between the action and the image segments is 
still substantially unexplored. For this reason, in this paper 
we propose to approach action recognition by leveraging an 
intermediate layer of “superpixels” whose latent classes can 
act as attributes of the action. In the proposed approach, 
the action class is predicted by a structural model(learnt by 
Latent Structural SVM) based on measurements from the im¬ 
age superpixels and their latent classes. Experimental results 
over the challenging Stanford 40 Actions dataset report a 
significant average accuracy of 74.06% for the positive class 
and 88.50% for the negative class, giving evidence to the 
performance of the proposed approacnM 

Index Terms — Action recognition from still images, su¬ 
perpixel segmentation, latent structural SVM. 



Fig. 1. Action recognition: bottom layer: superpixel segmen¬ 
tation and feature extraction; intermediate layer: superpixel 
classification; top variable: action class. 


1. INTRODUCTION AND RELATED WORK 

Automated recognition of actions in still images can play an 
important role for annotation of image catalogues, including 
the large collections of images which are increasingly made 
available by social networks. Actions which can be plausibly 
recognised from still imagery are those inferrable from the 
actors’ poses and the presence of relevant objects: examples 
range from “taking a picture” and “having a barbecue”, to 
“throwing a javelin” or “playing guitar”. Moreover, recogni¬ 
tion from single frames could also prove of fundamental value 
for recognising actions in video. For instance, in surveillance 
videos it is not uncommon to clearly sight an actor for only 
a few frames due to repeated occlusions. In such cases, it is 
not easy to recognise the action in dynamical terms, i.e., as 
the temporal evolution of a measurement vector. Rather, in¬ 
ference must be obtained as the cumulative evidence from a 
(possibly small) set of individual frames. Also in robotics, the 
varying camera viewpoint may make it easier to recognise ac¬ 
tions from isolated frames than from sequences. Estimation 
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from still frames is therefore a foundational technology for all 
these cases. 

The most straightforward solution to recognise actions 
from still images is to compute a bag-of-feature representa¬ 
tion of the image and use it for classifying it into a relevant 
set of action classes mm®. Useful features include local 
texture descriptors such as the histogram of oriented gradi¬ 
ents (HOG), dense SIFT, GIST and several others O [S [SI ■ 
Bag-of-features analysis usually discards the spatial coordi¬ 
nates at which descriptors are collected since it focusses on 
textural rather than spatial or structural information. These 
approaches have reported very interesting results on chal¬ 
lenging still image action datasets such as those described 
in 0E0. At the opposite end of the spectrum are ap¬ 
proaches based on the explicit recovery of body parts and 
the incorporation of structural information in the recognition 
process ME]. The baseline model is a latent part-based 
model akin to Pictorial Structure which can be estimated as 
a joint, conditional or max-margin model mm. Delaitre 
et al. has reported a comparison between a bag-of-features 
and a structural approach, showing that hybridisation of the 













two can be a way to capture the benefits of both models m. 
A recent survey from Guo and Lai offers a comprehensive 
outline of the research in this area 03- 

Overall, it appears that the existing approaches have not 
substantially explored the underlying relationship between 
the action class and the segments of the containing image. 
For this reason, in this paper we propose to approach action 
recognition in still images by leveraging the latent classes 
of the image’s “superpixels” (homogenous regions obtained 
from over-segmentation of the image fl5l ). To this aim, we 
have designed a graphical model with the action as its root 
node and a fully-connected layer of superpixel classes to 
capture the relationship with the image segments (Fig. [TJ. A 
rich measurement vector is extracted from each superpixel, 
and model training is provided by a latent structural SVM 
approach. 

The rest of the paper is organised as follows: Section 2 
describes the graphical model for action recognition and the 
detectors for the superpixels’ classes. Section 3 overviews 
the latent structural SVM framework. Section 4 describes the 
experiments and discusses results. Section 5 highlights con¬ 
clusions and future work. 

2. ACTION RECOGNITION BY SUPERPIXEL 
CLASSIFICATION 

The fundamental step taken in our approach is the decompo¬ 
sition of an image into small and coherent patches commonly 
referred to as superpixels. Our underlying assumption is that 
certain actions can be recognised effectively from an image 
by utilising useful information from the superpixels. Given 
their homogeneous nature, superpixels can be assigned single 
class labels of the type of “sky”, “road”, “face” and others, 
leading to a form of image (over-)segmentation (Fig. |T] bot¬ 
tom layers). While each superpixel in an image can be classi¬ 
fied individually using a trained classifier, a recent paper from 
Pei et al. EG! has shown that superpixel classification proves 
more accurate if all the superpixels in the image are classi¬ 
fied jointly. Accordingly, our approach consists of two main 
stages: in the first stage, we pre-train a superpixel classifier, 
or a set of object detectors, from a supervised set of image re¬ 
gions and we use it to compute class scores for all superpixels 
in a given image. In the second stage, such scores are used 
as measurements in a graphical model that provides optimal, 
joint decisions for all superpixels and the action class. In the 
rest of this section, we describe the graphical model and the 
object detectors. 

2.1. The graphical model 

The proposed graphical model comprises three sets of vari¬ 
ables, namely measurements (x), hidden nodes, or states, ( h ) 
and an output node ( y ). The measurements are a vector of 
detector scores for each superpixel, the hidden nodes are their 


classes, and the output node is the categorical variable for the 
action. The nodes are connected by three different types of 
edges: a) edges connecting measurements and states, b) edges 
over state pairs, and c) edges between states and the action 
class. Given that the dependencies between states may ex¬ 
tend over the entire image, we assume that the states and their 
edges form a fully-connected graph. Approximate inference 
is provided by a greedy algorithm that iteratively maximises 
over each state in turn. 

Noting the number of superpixels in the v-th image in a 
training set as T), we have Xi = [xj,..., x \,..., x^], with x\ 
a D-dimensional vector of scores; h t = [hj,..., h \,..., hf*], 
with h\ £ {1,..., K} a superpixel class; and y, : £ {0,1} a 
binary variable for a given action class. We build one such 
model for each action class. At training time, the action class 
is supervised while states are hidden. 


2.2. Object detectors 

The first step of our processing pipeline is the decomposi¬ 
tion of the image into superpixels. For this task, we use 
the efficient graph-based segmentation algorithm proposed 
by Felzenszwalb et al. CCD that was also adopted by M- 
This step achieves good over-segmentation of the image into 
regions of predominantly homgeneous nature, and errors in 
this process can be tolerated by the ensuing soft-assignment 
stage. Fig. 1 shows an example of superpixel segmentation. 

To classify the superpixels, we have used the class set of 
the MSRC-21 dataset consisting of 23 diverse classes from 
typical background and foreground objects m Similarly 
to lfl6l . we use a combination of appearance-based and bag- 
of-features descriptors as the feature vector. The appearance- 
based descriptor has 51 features, comprising of: 1) 40 color 
features measuring mean, standard deviation, skewness and 
kurtosis of RGB, LAB, YCrCb color space channels and the 
gray image; and 2) 11 texture features obtained from the 
application of an average filter and five different responses 
from Gaussian and Laplacian-of-Gaussian filters. The bag- 
of-features is obtained by first computing dense SIFT de¬ 
scriptors m in the superpixel region at three different scales 
and then encoding the descriptors into a dictionary of 400 
visual words learned by /('-means clustering. We concatenate 
the appearance-based descriptor and the bag-of-features into 
a 451-D vector, noted as s\ for the /-th superpixel of the /-th 
image. 

Once the feature vector is extracted, a superpixel classi¬ 
fier is built by multiclass SVM lff9l l20l with a linear ker¬ 
nel trained over the MSRC-21 dataset. Please note that this 
dataset is a separate dataset from the action dataset and that 
object detectors will not be re-trained. Once the classifier is 
trained, for every measurement, s\, we compute and collect 
the probability scores of all 23 classes as a feature vector, x\: 


[x\ =p[k\s \) (X exp {wlsl)] , Vfc 


(5) 


( 1 ) 

where Wk notes the fc-th class’ parameter vector of the 
SVM multiclass classifier. Such a vector of posterior prob¬ 
abilities will be later used as the superpixel’s measurement 
for action classification, exploiting the semantic of the object 
classes and reducing the measurement dimensionality from 
451-D to 23-D. 

3. LATENT STRUCTURAL SVM 

For action classification, we wish to learn the following linear 
prediction function: 

( V, h) = fw(x) = axgmax[w T ip(x, h,y)] (2) 

V,h 

where ip{x, h, y) is a generalised feature function comput¬ 
ing a combined map over measurements x, states h and class 
y, and w is a corresponding parameter vector. The action class 
and superpixel states are predicted jointly and typically only 
the predicted class, y, is retained. For parameter estimation, 
we adopt the well-established latent structural SVM frame¬ 
work ED. This is a regularised minimum-risk framework 
guaranteed to provide a local optimum for structural models 
with latent variables. Its learning objective: 

1 N 

w* = argmin- ||iu|| 2 +C'^& 

w,£i-n " i 

1=1 (3) 

s.t. w T 'fy(x i , h *, 2 /j) - w T ^f(xi, h,y)>l-£i 


h *iLu = argmax[p(fc|s*)] 

k 

The above is equivalent to initialising the states with indi¬ 
vidual predictions, delegating the discovery of correlations to 
the training stage. 

3.1. Feature function and score function 

The features in feature function ip(x,h,y) reflect the topol¬ 
ogy of the graphical model that includes an edge between 
each superpixel’s measurement and its state variable, a fully- 
connected graph amongst states, and an edge between each 
state and the action class. In detail, ft{x, h, y) breaks into: 

• measurement features, pixf ,lf = j): these features 
map the measurement vector of the f-th superpixel, x f 
(dimensionality: K ) to its state, h l (possible values: 
{1... A'}). The size of this feature vector is K 2 and, 
given If = j, it consists of x* starting at index K (j — 1) 
and zero-padding elsewhere; 

• state features. Off = j, h u = k ): these features report 
the co-occurrence of states If = j and h u = k. The 
size of this feature vector is again K 2 , with a value 1 at 
index K(j — 1) + k — 1 and zeros elsewhere; 

• class features, f{y = 6 6 {0,1}, /z* = j): these fea¬ 
tures report the co-occurrence of action class y = b and 
state If = j. The size of this feature vector is 2K, with 
a value 1 at index bK + j — 1 and zeros elsewhere. 

We refer the reader to 1201 for further details on feature 
maps. Given such feature vectors, the score function in Q is 
computed as: 


h* = argmaxw;* T 4'(a; i , h, yf) (4) 

h 

is an iterative objective that alternates between the con¬ 
strained optimisation in (|3j, performed using the current val¬ 
ues for latent variables h*, and a new assignment for h* Q 
from updated model w*. Implementation requires a number 
of design choices including the definition of a suitable fea¬ 
ture function, ip(x,h,y), the initialisation of latent variables 
h, and efficient algorithms for inference and augmented in¬ 
ference. These components are presented in the following 
sub-sections. 

The learning procedure in ([3][4]l can be initialised by either 
an arbitrary vector w* in 0 or an arbitrary assignment for the 
h* in 0. Given that we have trained a multiclass superpixel 
classifier to obtain the feature set, the most natural choice is 
to initialise the states with the prediction from this classifier: 
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t= 1 


w T ip(x , h, y) = ^w 1 p ip{x\ If) + w s d ( ht ’ hU ) 

t— 1 14 = 1 , 

U^t 

T 
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where w 1 = 


T T T 
K w 8 w 4> 


( 6 ) 


is the concatenation of the 


parameter vectors for the corresponding features. 


3.2. Loss-augmented inference 

Following the structured learning approach of ED, a funda¬ 
mental step in the learning procedure is the computation of 
a loss-augmented version of the inference. As loss function, 





we simply use the 0-1 loss function: 


0 if y ^ y gt 
1 if y = y gt 

where y gt represents the ground-truth label. With this 
choice, it can easily be seen that the loss-augmented infer¬ 
ence: 



(y, h) = argmax [w T ip(x, h, y) + A (y gt , y)] (7) 

y,h 

is equivalent to the standard inference in (|2ji with the ad¬ 
dition of a unit score over the incorrect class. 

4. EXPERIMENTAL RESULTS 

We have evaluated the proposed approach on the most chal¬ 
lenging static action recognition dataset released to date, 
Stanford 40 Actions ll22ll . This dataset contains images of 
humans performing 40 different classes of actions, including 
visually-challenging cases such as “fixing a bike” versus “rid¬ 
ing a bike” or “phoning” versus “texting message”; the full 
class list is provided in the annotation of Fig. 2. The number 
of samples per class varies between 180 and 300, for a total of 
9, 532 images. A standard training/test split is made available 
by the authors on their website, selecting 100 images from 
each class for training and leaving the remaining for testing. 

Training single-class classifiers, also referred to as detec¬ 
tors, using all the available training samples leads to a very 
unbalanced set over the positive and negative classes (100 and 
3,900 samples, respectively). In the case of a conventional 
SVM objective as in 0 > this biases the prediction function 
towards negative predictions. For this reason and in order to 
save learning time, we decided to sub-sample the negative 
training samples of each classifier by randomly choosing 5 
images from each of its 39 negative classes. Parameter C in 
0 was set to 1. 


writing_on_a_book 
writing_on_a_board 
waving_hands 
watching_TV 
washing_dishes 
walking_the_dog 
using_a_computer 
throwing_frisby 
texting_message 
takingjDhotos 
smoking 
shooting_an_arrow 
running 
rowing_a_boat 
riding_a_horse 
ridingabike 
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reading 
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fishing 
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Fig. 2. Achieved accuracy over the positive and negative class 
for all classes in the Stanford 40 Actions dataset. 


instance, than that recently reported in |23| (55.93%). For 
the individual classes, some classes reach very high accuracy 
over both positive and negative samples. For instance, class 
“rowing a boat” achieves 93.90% and 96.15% accuracy, re¬ 
spectively. Other classes show significant missed detections: 
for instance, class “running” achieves only 48.99% accuracy 
over the positive samples, arguably as it is hard to recognise a 
running action from a static frame. Overall, these results seem 
even more remarkable considering that the 23 object detectors 
were trained on a completely separate dataset (no re-training 
was performed on the action dataset) and from classes mostly 
unrelated with the objects portrayed in Stanford 40 Actions 
(such as bikes, phones, cameras, telescopes, microscopes and 
several others). 


When measuring the performance of a detector, it would 
be trivial to achieve high overall accuracies by always pre¬ 
dicting the negative class. Therefore, a dataset like Stanford 
40 Actions requires measuring the accuracy for the positive 
and negative classes separately, or providing measures such 
as precision and recall, average precision at various levels of 
recall, or similar. In this work, we decided to report the ac¬ 
curacy for the positive and negative classes as the main fig¬ 
ure, as shown in Fig. 2. The mean accuracy over all the 40 
classes proved 74.06% for the positive class and 88.50% for 
the negative class. These results prove that the individual de¬ 
tectors are well balanced over positive and negative predic¬ 
tions and show that the overall accuracy is much higher, for 


5. CONCLUSION 

In this paper, we have proposed an approach to action recog¬ 
nition in still images leveraging an intermediate superpixel 
representation of the image. The approach consists of two 
stages: in a first stage, the image is segmented into a set of 
superpixels and an array of trained object detectors is applied 
to each superpixel to extract a vector of detector scores. In a 
second stage, the score vectors are used as measurements in a 
graphical model that jointly predicts the superpixels’ classes 
together with the action class. Experiments conducted over 
the highly challenging Stanford 40 Actions dataset have re¬ 
sulted in a remarkable accuracy of 74.06% for the positive 




















class and 88.50% for the negative class averaged over the 40 
action classifiers. These results give evidence to the existence 
of a useful relationship between the classes of the image su¬ 
perpixels and that of the main action. Possible ways to further 
improve the performance of the proposed model would be to 
adopt a larger, more universal set of object detectors, or a set 
of detectors tuned in the specific object classes of given action 
datasets. 

6. REFERENCES 

[1] V. Delaitre, I. Laptev, and J. Sivic, “Recognizing hu¬ 
man actions in still images: a study of bag-of-features 
and part-based representations,” in Proceedings of the 
British Machine Vision Conference, 2010, pp. 1—11. 

[2] Nazli Ikizler, R. Gokberk Cinbis, Selen Pehlivan, and 
Pinar Duygulu, “Recognizing actions from still images,” 
in Proceedings of the 19th International Conference on 
Pattern Recognition, 2008, ICPR 2008, 2008, pp. 1-4. 

[3] I. Laptev, “On space-time interest points,” International 
Journal of Computer Vision, vol. 64, no. 2, pp. 107-123, 
2005. 

[4] Navneet Dalai and Bill Triggs, “Histograms of oriented 
gradients for human detection,” in International Confer¬ 
ence on Computer Vision & Pattern Recognition, June 
2005, vol. 2, pp. 886-893. 

[5] D.G. Lowe, “Distinctive image features from scale- 
invariant keypoints,” International journal of computer 
vision, vol. 60, no. 2, pp. 91-110, 2004. 

[6] Aude Oliva and Antonio Torralba, “Modeling the shape 
of the scene: A holistic representation of the spatial en¬ 
velope,” Int. J. Comput. Vision, vol. 42, no. 3, pp. 145— 
175, May 2001. 

[7] N. Ikizler-Cinbis, R.G. Cinbis, and S. Sclaroff, “Learn¬ 
ing actions from the web,” in 2009 IEEE 12th Interna¬ 
tional Conference on Computer Vision. IEEE, 2009, pp. 
995-1002. 

[8] A. Gupta, A. Kembhavi, and L.S. Davis, “Observing 
human-object interactions: Using spatial and functional 
compatibility for recognition,” IEEE Transactions on 
Pattern Analysis and Machine Intelligence, vol. 31, no. 
10, pp. 1775 -1789, oct. 2009. 

[9] Bangpeng Yao and Li Fei-Fei, “Grouplet: A structured 
image representation for recognizing human and object 
interactions,” in IEEE Conference on Computer Vision 
and Pattern Recognition, CVPR 2010, 2010, pp. 9-16. 

[10] Weilong Yang, Yang Wang, and Greg Mori, “Recogniz¬ 
ing human actions from still images with latent poses,” 
in CVPR, 2010, pp. 2030-2037. 


[11] Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai 
Lin, Leonidas J. Guibas, and Li Fei-Fei, “Action recog¬ 
nition by bases of action attributes and parts,” in Inter¬ 
national Conference on Computer Vision (ICCV), 2011, 
pp. 1331-1338. 

[12] M. Fischler and R. Elschlager, “The representation and 
matching of pictorial structures,” IEEE Transactions on 
Computers, vol. 22, no. 1, pp. 67-92, Jan. 1973. 

[13] R F. Felzenszwalb, R. B. Girshick, D. McAllester, and 
D. Ramanan, “Object detection with discriminatively 
trained part-based models,” IEEE Transactions on Pat¬ 
tern Analysis and Machine Intelligence, vol. 32, no. 9, 
pp. 1627-1645, Sept. 2010. 

[14] Guodong Guo and Alice Lai, “A survey on still image 
based human action recognition,” Pattern Recognition, 
vol. in press, no. 0, pp. -, 2014. 

[15] Pedro F Felzenszwalb and Daniel P Huttenlocher, “Ef¬ 
ficient graph-based image segmentation,” International 
Journal of Computer Vision, vol. 59, no. 2, pp. 167-181, 

2004. 

[16] Deli Pei, Zhenguo Li, Rongrong Ji, and Fuchun Sun, 
“Efficient semantic image segmentation with multi-class 
ranking prior,” Computer Vision and Image Understand¬ 
ing, 2013. 

[17] A Criminisi, “Microsoft research 

Cambridge object recognition image 

database,http://research.microsoft.com/en- 
us/projects/objectclassrecognition/,” 2004. 

[18] A. Vedaldi and B. Fulkerson, “VLFeat: An open and 
portable library of computer vision algorithms,” http : 
/ /www. vlf eat. org/, 2008. 

[19] Koby Crammer and Yoram Singer, “On the algorithmic 
implementation of multiclass kernel-based vector ma¬ 
chines,” The Journal of Machine Learning Research, 
vol. 2, pp. 265-292, 2002. 

[20] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Al- 
tun, “Large margin methods for structured and interde¬ 
pendent output variables,” JMLR, vol. 6, pp. 1453-1484, 

2005. 

[21] Chun-Nam John Yu and Thorsten Joachims, “Learning 
structural svms with latent variables,” in Proceedings of 
the 26th Annual International Conference on Machine 
Learning. ACM, 2009, pp. 1169-1176. 

[22] Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai 
Lin, Leonidas Guibas, and Li Fei-Fei, “Human action 
recognition by learning bases of action attributes and 
parts,” in Computer Vision (ICCV), 2011 IEEE Inter¬ 
national Conference on. IEEE, 2011, pp. 1331-1338. 



[23] Fadime Sener, Cagdas Bas, and Nazli Ikizler-Cinbis, 
“On recognizing actions in still images via multiple fea¬ 
tures,” in Computer Vision ECCV 2012. Workshops 
and Demonstrations, 2012, vol. 7585 of Lecture Notes 
in Computer Science, pp. 263-272. 



